TL/UCP: Update topo aware ring algorithm #1288
Conversation
6c2242e to
843043c
Compare
|
/build |
|
@Juee14Desai @janjust I assume this will replace PR #1258 ? |
|
We talked about this, it shouldn't need to. If they are both good to go then let's have both ring and topo-aware ring and we can phase one out as needed. |
b80e124 to
9472c37
Compare
bd96bad to
e827813
Compare
|
/build |
|
| Filename | Overview |
|---|---|
| src/components/tl/ucp/tl_ucp_coll.c | Wires up the new reduce_scatter dynamic score string correctly, but the allreduce ring algorithm has no corresponding str_get_fn — the ring allreduce is never auto-selected for CUDA workloads. |
| src/components/tl/ucp/allgather/allgather_ring.c | Rewrites allgather ring to use the cuda_ring topology pattern with up to 8 parallel rings; start counter sentinel (1) and progress loop logic are internally consistent, but count-divisibility is not checked (flagged in previous review). |
| src/components/tl/ucp/reduce_scatter/reduce_scatter_ring.c | Complete rewrite to topology-aware multi-ring; non-in-place path still corrupts source buffer (flagged in previous review); ping-pong scratch for persistent is correct; missing send==recv symmetry assertion. |
| src/components/tl/ucp/allreduce/allreduce_ring.c | New schedule-based allreduce ring that chains RS+AG sub-tasks; functionally correct for in-place CUDA workloads, but never auto-selected because no dynamic score string was added to tl_ucp_coll.c; also lacks early cuda_ring guard before schedule allocation. |
| src/components/tl/ucp/tl_ucp_service_coll.c | Extracts a dedicated service_allgather_ring_start/progress so internal collectives continue using the flat ring; function pointers are correctly initialized before use. |
Reviews (6): Last reviewed commit: "TL/UCP: topo aware ring algo for reduce_..." | Re-trigger Greptile
|
|
||
| send_idx = ucc_ring_pattern_get_send_block(ring, ring_id, | ||
| rrank, step); | ||
| ring_offset = ucc_buffer_block_offset(block_cnt, nrings, ring_id); |
There was a problem hiding this comment.
Missing
send_posted == recv_posted assertion
The allgather ring progress function asserts task->tagged.send_posted == task->tagged.recv_posted at the top of its loop (since sends and recvs are always posted in pairs). The allreduce ring's RS-phase loop has the send_posted > 0 / recv_posted > 0 guards but omits the equality check, making it harder to catch a bookkeeping bug early. Consider adding the same assertion for consistency and diagnosability.
e827813 to
282b683
Compare
| reduce_target = PTR_OFFSET( | ||
| sbuf, | ||
| (recv_block * block_cnt + ring_offset) * dt_size); |
There was a problem hiding this comment.
Non-in-place operation silently corrupts the source buffer
For non-in-place reduce_scatter, sbuf = args->src.info.buffer. On every non-final step the reduction writes back into that same buffer:
reduce_target = PTR_OFFSET(sbuf, (recv_block * block_cnt + ring_offset) * dt_size);The subsequent send loop then reads from the now-mutated sbuf location to forward the intermediate partial sum. The old implementation accumulated partial results in the dedicated s_scratch send buffer and left sbuf read-only. Callers with separate source and destination buffers (i.e., any non-in-place reduce_scatter over CUDA memory when cuda_ring is present) will have their source data silently overwritten.
The simplest safe fix is to reject non-in-place in reduce_scatter_ring_init_common (analogous to the UCC_IS_PERSISTENT guard already there):
if (!UCC_IS_INPLACE(*args)) {
return UCC_ERR_NOT_SUPPORTED;
}Or alternatively, allocate a separate per-ring work buffer for intermediate reductions instead of reusing sbuf.
|
/build |
2 similar comments
|
/build |
|
/build |
282b683 to
453ac19
Compare
|
/build |
453ac19 to
e85aca1
Compare
|
/build |
|
/build |
| } | ||
| } | ||
|
|
||
| if (algo_num == UCC_TL_UCP_ALLGATHER_ALG_RING && !team->cuda_ring) { |
There was a problem hiding this comment.
Algorithm regression for odd-size non-CUDA teams
When cuda_ring is NULL (any CPU team, or GPU team without NVLink topology), odd-size teams previously used the flat RING algorithm. After this guard, they fall through to KNOMIAL. This silently changes the default for every odd-size non-CUDA workload, including large CPU clusters.
The underlying reason is that allgather_ring_init_common now hard-fails with UCC_ERR_NOT_SUPPORTED when cuda_ring == NULL, so the score-string must avoid selecting it. However, the correct fallback for the "no-topology-info" case is still the flat ring (old behavior), not knomial. Consider keeping the flat-ring algorithm alive under a separate name/enum and using it as the fallback, or only switching to KNOMIAL when the caller is guaranteed to benefit.
Replace the default ring allgather with a topo aware multi ring implementation that uses team->cuda_ring to route data along NVLink optimal paths (up to 8 parallel rings). Algorithm changes: - Ring rank, peer, and block indices are now derived from the cuda_ring topology pattern instead of flat team rank ordering. - Each ring transfers its own slice of each block, enabling concurrent data movement across multiple NVLink paths. - Algorithm auto selected for CUDA memory >4KB when cuda_ring is available; falls back to knomial otherwise. Also fixes CUDA primary context detection in ucc_sysinfo_cuda.c and decouples the service allgather from the topo aware ring. Signed-off-by: Juee Himalbhai Desai <jueehimalbha@nvidia.com>
Replace the default ring reduce_scatter with a topo aware multi ring implementation that uses team->cuda_ring to route data along NVLink optimal paths (up to 8 parallel rings). Algorithm changes: - Ring rank, peer, and block indices are now derived from the cuda_ring topology pattern instead of flat team rank ordering. - Each ring handles its own sub block slice, with per ring GPU reductions via the executor before forwarding to the next peer. - Scratch buffer management simplified to a single mc_alloc/free per task lifetime (removed fragmentation logic). Signed-off-by: Juee Himalbhai Desai <jueehimalbha@nvidia.com>
wfaderhold21
left a comment
There was a problem hiding this comment.
If I understand correct, this PR is removing CPU-based allgather/reduce_scatter ring algorithms. We will need a future PR to bring those back before 1.9.0 release.
| ucc_rank_t tsize = ucc_ring_pattern_size(ring, 0); | ||
| ucc_rank_t block = UCC_TL_TEAM_RANK(team); | ||
| size_t data_size = (count / tsize) * ucc_dt_size(dt); | ||
| ucc_status_t status; |
| @@ -1,5 +1,5 @@ | |||
| /** | |||
| * Copyright (c) 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |||
| * Copyright (c) 2021-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |||
| if (!UCC_IS_INPLACE(*args)) { | ||
| status = ucc_mc_memcpy(PTR_OFFSET(rbuf, data_size * block), | ||
| sbuf, data_size, rmem, smem); | ||
| sbuf, data_size, rmem, smem); |
| @@ -1,424 +1,308 @@ | |||
| /** | |||
| * Copyright (c) 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |||
| * Copyright (c) 2022-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |||
| step - 1); | ||
| data_displ = (send_block * block_cnt + ring_offset) * dt_size; | ||
| send_src = PTR_OFFSET(sbuf, data_displ); | ||
| next_recv_dst = PTR_OFFSET(scratch, ring_offset * dt_size); |
| void *scratch = task->reduce_scatter_ring.scratch; | ||
| ucc_rank_t ring_id, rrank, adj_rrank, send_block, sendto, recvfrom; | ||
| size_t ring_offset, ring_count, data_displ, data_size; | ||
| ucc_status_t status; |
| ucc_tl_ucp_team_t *tl_team; | ||
| ucc_status_t status; | ||
| ucc_tl_ucp_team_t *team = TASK_TEAM(task); | ||
| ucc_coll_args_t *args = &TASK_ARGS(task); |
| size_t count = TASK_ARGS(task).dst.info.count; | ||
| size_t data_size = (count / tsize) * ucc_dt_size(TASK_ARGS(task).dst.info.datatype); | ||
| ucc_rank_t sendto, recvfrom, sblock, rblock; | ||
| int step; |
What
Add topology aware multi ring algorithms for allgather, reduce_scatter, and allreduce in TL/UCP. The ring algorithms use team->cuda_ring to route data along NVLink optimal paths with up to 8 parallel rings, instead of the default single ring.
Why ?
The default ring algorithms use a flat rank ordering that does not account for the underlying GPU interconnect topology. On multi GPU systems with NVLink, this results in suboptimal data routing transfers may traverse slower paths instead of direct NVLink links.
How ?
Allgather:
Reduce_scatter:
Allreduce: